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Abstract 



In this article, we study shape fitting problems, e-coresets, and total sensitivity. We focus on the 
(j, fc)-projective clustering problems, including A>median/fc-means, fc-line clustering, j-subspace 
approximation, and the integer (j, fc)-projective clustering problem. We derive upper bounds of 
total sensitivities for these problems, and obtain e-coresets using these upper bounds. Using a 
dimension-reduction type argument, we are able to greatly simplify earlier results on total sens- 
itivity for the k- median/ fc-means clustering problems, and obtain positively- weighted e-coresets 
for several variants of the (j, fc)-projective clustering problem. We also extend an earlier result 
on e-coresets for the integer (j, /c)-projective clustering problem in fixed dimension to the case of 
high dimension. 

T 1 

1998 ACM Subject Classification F.2.2 Analysis of Algorithms and Problem Complexity 
Keywords and phrases Coresets, shape fitting, k-means, subspace approximation 
, ^, Digital Object Identifier |10.4230 /LlPIcs.xxx.yyyp 



(N 

o 

(N 
+-> 

O 

O 
o 



On 
O 



X 



p. 1 Introduction 



In this article, we study shape fitting problem, coresets, and in particular, total sensitivity. 

OO 



A shape fitting problem is specified by a triple (R , F, dist), where R is the d-dimcnsional 
Euclidean space, J 7 is a family of subsets of R d , and dist : M. d x R d — > R + is a continuous 
function that we will refer to as a distance function. We also assume that (a) dist(p, q) = if 
and only if p — q, and (b) dist(p, q) = dist(<7,p). We refer to each F G F as a shape, and we 
require each shape F to be a non-empty, closed, subset of M. d . We define the distance of a 
point p G R d to a shape F G F to be dist(p, F) = min qe F dist(p, q). An instance of a shape 
fitting problem is specified by a finite point set P C K d . We slightly abuse notation and use 
dist(P, F) to denote X) P ep dist(p, F) when P is a set of points in M. d . The goal is to find a 
shape which best fits P, that is, a shape minimizing ^2 peP dist(p, F) over all shapes F G T . 
This is referred to as the L\ fitting problem, which is the main focus of this paper. In the 
Lao fitting problem, we seek to find a shape F G F minimizing max pe p dist(p, F). 

In this paper, we focus on the (j, fc)-projective clustering problem. Given non-negative 
integers j and k, the family of shapes is the set of fc-tuples of affine j-subspaces (that is, 
j-flats) in M. d . More precisely, each shape is the union of some k j-flats. The underlying 
distance function is usually the z th power of the Euclidean distance, for a positive real number 
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z. When j — 0, F is the set of all fc-point sets of M. d , so the (0, fc)-projective clustering 
problem is the fc-median clustering problem when the distance function is the Euclidean 
distance, and it is the fc-means clustering problem when the distance function is the square of 
the Euclidean distance; when j = 1, the family of shapes is the set of fc-tuples of lines in M. d ; 
when fc = 1, (j, l)-projective clustering is the subspace approximation problem, where the 
family of shapes is the set of j-flats. Other than these projective clustering problems where 
j or fc is set to specific values, another variant of the (j, fc)-projective clustering problem 
is the integer (j, fc)-projective clustering problem, where we assume that the input points 
have integer coordinates (but there is no restriction on j and fc), and the magnitude of these 
coordinates is at most n c , where n is the number of input points and c > is some constant. 
That is, the points are in a polynomially large integer grid. 

An e-coreset for an instance P of a shape fitting problem is a weighted set S, such that for 
any shape F G F, the summation of distances from points in P approximates the weighted 
summation of the distances from points in S up to a multiplicative factor of (1 ± e). A 
more precise definition (Definition [TJ follows later. Coresets can be considered as a succinct 
representation of the point set; in particular, in order to obtain a (1 + e)-approximation 
solution fitting P, it is sufficient to find a (1 + e)-approximation solution for the coreset 
S. One usually seeks a small coreset, whose size \S\ is independent of the cardinality of P. 
Coresets of size o(n) for the (j, fc)-projective clustering problem for general j and k are not 
known to exist. However, the fc-median/fc-means clustering, fc-line clustering, j-subspace 
approximation, and integer (J, fc)-projective clustering problems admit small coresets. 

Langberg and Schulman |10j introduced a general approach to coresets via the notion 
of sensitivity of points in a point set, which provides a natural way to set up a probability 
distribution Pr • on P. Roughly speaking, the sensitivity of a point with respect to a 
point set measures the importance of the point, in terms of fitting shapes in the given 
family of shapes F. Formally, the sensitivity of point p in a point set P is defined by 
crp(p) := sup f 6 jr dist(p, F)/dist(F, F). (In the degenerate case where the denominator in 
the ratio is 0, the numerator is also 0, and we take the ratio to be 0; the reader should 
feel free to ignore this technicality.) The total sensitivity of a point set P is defined by 
&p := X)pep a p(p)- The nice property of quantifying the "importance" of a point in a point 
set is that for any F € F, dist(p, F)/dist(P, F) < <Jp(p). Setting the probability of selecting 
p to be crpip) l&p, and the weight of p to be &p/<jp(p), Vp e P, one can show that the 
variance of the sampling scheme is 0((&p) 2 ). When &p is o(n), (for example, a constant or 
logarithmic in terms of n — |P|), one can obtain an e-coreset by sampling a small number 
of points. Langberg and Schulman |1U] show that the total sensitivity of any (arbitrarily 
large) point set P C Mr for fc-median/ fc-means clustering problem is a constant, depending 
only on fc, independent of the cardinality of P and the dimension of the Euclidean space 
where P and F are from. Using this, they derived a coreset for these problems with size 
depending polynomially on d and fc and independent of n. Their work can be seen as evolving 
from earlier work on coresets for the fc-median/fc-means and related problems via other low 
variance sampling schemes [3J |H [71 [S] . 

Feldman and Langberg [B] relate the notion of an e-coreset with the well-studied notion 
of an e-approximation of range spaces. They use a "functional representation" of points: 
consider a family of functions V — {f P (-)\p G P}, where each point p is associated with a 
function f p : X — > K. The target here is to pick a small subset S C P of points, and assign 
weights appropriately, so that J2 P <=s w pfp( x ) approximates J2 P <=p fp( x ) a ^ every x € X. 
When X is F and f p (F) = dist(p, F), this is just the original e-coreset for P. However, f p (-) 
can be any other function defined over F, for example, / p (-) can be the "residue distance" 
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of p, i.e., f p (F) = |dist(p, F) — dist(p', F)\, where p' is the projection of p on the optimum 
shape F* fitting P. The definitions of sensitivities and total sensitivity easily carry over 
in this setting: a-p(f p ) = swp xf£X f p (x) / J2f fq{ x ) (which coincides with <rp(p) when 
fp(-) is dist(p, ■)), and 6-p = J2f ev^ifp) (which coincides with &p similarly). One of 
the results in [5] is that an approximating subset S C P can be computed with the size 
IS* | upper bounded by the product of two quantities: (6-p) 2 , and another parameter, the 
"dimension" (see Definition [3]) of a certain range space induced by V, denoted dim('P). We 
remark that dim (V) depends on d, which is the dimension of Euclidean space where P is 
from, and some other parameters related to X; when X is the family of shapes for the 
(j, fc)-projective clustering problem, dim (V) also depends on j and k. This connection allows 
them to use many results from the well-studied area of e-approximation of range spaces (such 
as deterministic construction of small e-approximation of range spaces) , thus constructing 
smaller coreset deterministically, and removes some routine analysis in the traditional way of 
obtaining coresets via random sampling. 

1.1 Our Results 

In this article, we prove upper bounds of total sensitivities for the (j, £;)-projective clustering 
problems. In particular, we show a careful analysis of computing total sensitivities for shape 
fitting problems in high dimension. Total sensitivity & p for a point set P C K d may depend 
on d: consider the shape fitting problem where the family of shapes is the set of hyperplanes, 
and P is a point set of size d in general position. Then clearly ap(p) — 1 (since there always 
exists a hyperplane containing all d — 1 points other than p) , so &p = d. 

One question that arises naturally is that whether the dependence of the total sensitivity 
on the dimension d is essential. To answer this question, we show that if the distance 
function is Euclidean distance, or the z th power of Euclidean distance for z £ [l,oo), then 
the total sensitivity function of a shape fitting problem (Mr, F, dist) in the high dimensional 
space M. d is roughly the same as that of the low-dimensional variant (W 1 ,F',dist), where 
d' is the "intrinsic" dimension of the shapes in F, and T' consists of shapes contained in 
the low dimensional space R d . A reification of this statement is that the total sensitivity 
function of the (j, fc)-projective clustering is independent of d. For the (j, fc)-projective 
clustering problems, the shapes are intrinsically low dimensional: each fc-tuple of j-flats is 
contained in a subspace of dimension at most k(j + 1). As we will see, the total sensitiv- 
ity function for (R d , F, dist), where F is the family of /c-tuples of j-flats in M. d , is of the 
same magnitude as the total sensitivity function of QEfc-ftf>*0 , J^' i dist), where f(j,k) is a func- 
tion of j and k (which is independent of d), and F' is the family of A;-tuples of j-flats in R^^ k K 

We sketch our approach to upper bound the total sensitivity of the (j, fc)-projective 
clustering. We first make the observation (Theorem [7] below) that the total sensitivity of 
a point set P is upper bounded by a constant multiple of the total sensitivity of P' = 
proj (P, F*), which is the projection of P on the optimum shape F* fitting P in F. The 
computation of total sensitivity of P' is very simple in certain cases; for example, for fc-median 
clustering, P' is a multi-set which contains k distinct points, whose total sensitivity can be 
directly bounded by k. Therefore, we are able to greatly simplify the proofs in |10j . Another 
more important use of this observation is that it allows us to get a dimension-reduction 
type result for the (j, /c)-projective clustering problems: note that although the point set 
and the shapes might be in a high dimension space K d , the projected point set P' lies in a 
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subspace of dimension (J + l)fc (since each fc-tuple of j-flats is contained in a subspace of 
dimension at most (J + l)fc), which is small under the assumption that both j and k are 
constant. Therefore, (5p, which usually depends on d if one directly computes it in a high 
dimensional space, depends only on j and k, since ©p is 0(©p/). 

Our method for bounding the total sensitivity directly translates into a template for 
computing e-coresets: 

1. Compute F* , the optimal shape fitting P. (It suffices to use an approximately optimal 
shape.) Compute P', the projection of P onto F*. 

2. Compute a bound on the sensitivity of each point in P' with respect to P' . Since the 
ambient dimension is 0(jk), we may use a method that yields bounds on ©p/ with 
dependence on the ambient dimension. Use Theorem [7] to translate this into a bound for 
ap(p) for each peF. 

3. Sample points from P with probabilities proportional to <7p(p) to obtain a coreset, as 
described in [TtJllS]. 

We now point out the difference between our usage of total sensitivity in the construction 
of coresets and the method in [BJ. The construction of coresets in [BJ may also be considered 
as based on total sensitivity, however in a very different way: 

1. First obtain a small weighted point set S C P, such that dist(P, F) — dist(P',P) is 
approximately the same as dist(S', F) — dist(S", F) (S' is proj (S, F*)) for every F 6 F. 

2. Then compute an e-coreset Q' C P' for the projected point set P', that is, dist(Q',P) 
approximates dist(P',P) for every F G F. (Since P' is from a low-dimensional subspace, 
the ambient dimension is small, and the computation can exploit this.) 

Therefore, for each F E F, dist(P,P) = (dist(P,P) -dist(P',P)) +dist(P',P) w 
(dist(S, F) ~ dist(5', F)) + dist(Q', P). 

Thus the weighted set Q' U S U S' is a coreset for P, but notice that the points in S' 
have negative weights. In contrast, the weights of points in the coreset in our construction 
are positive. The advantage of getting coresets with positive weights is that in order to get 
an approximate solution to the shape fitting problem, we may run algorithms or heuristics 
developed for the shape fitting problem on the coreset, such as pQ. When points have negative 
weights, on the other hand, some of these heuristics do not work or need to be modified 
appropriately. 

Another useful feature of the coresets obtained via our results is that the coreset is a 
subset of the original point set. When each point stands for a data item, the coreset inherits 
a natural interpretation. See [IT] for a discussion of this issue in a broader context. 

The sizes of the coresets in this paper are somewhat larger than the size of coresets in [BJ. 
Roughly speaking, the size of the coreset in [BJ is fx (d) + f% ( j, k) , where fx (d) (respectively 
fi{j,k)) is a function depending only on d (respectively j and k) for the (J, fc)-projective 
clustering problem, while the coreset size in our paper is fi(d) ■ /2O, k). 
Organization of this paper: In this article, we focus on the construction that establishes 
small total sensitivity for various shape fitting problems, and the size of the resulting coreset. 
For clarity, we omit the description of algorithms for computing such bounds on sensitivity. 
Efficient algorithms result from the construction using a methodology that is now well- 
understood. Also because the weights for points in the coreset are nonnegative, the coreset 
lend itself to streaming settings, where points arrive one by one as Px,P2, • ■ • [S][S]- In Section 
2, we present necessary definitions used through this article, and summarize related results 
from [BJ and |12j . In Section 3, we prove the upper bound of total sensitivity of an instance of 
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a shape fitting problem in high dimension by its low dimensional projection. In Sections 4, 5, 
6, and 7, we apply the upper bound from Section 3 to fc-median/fc-means, clustering, fc-line 
clustering, j-subspace approximation, and the integer (j, fc)-projective clustering problem, 
respectively, to obtain upper bounds for their total sensitivities, and the size of the resulting 
e-coresets. 

2 Preliminaries 

In this section, we formally define some of the concepts studied in this article, and state 
crucial results from previous work. We begin by defining an e-coreset. 

► Definition 1 (e-coreset of a shape fitting problem). Given an instance P C M. d of a 
shape fitting problem (R d , F, dist), and e € [0,1], an e-coreset of P is a (weighted) set 
S C P, together with a weight function w : S — > R + , such that for any shape F in F, 
it holds that |dist(P, F) — dist (5, F)\ < e-dist(P,P), where by definition, dist(P, F) = 
J2 P eP dist(p, F), and dist (5, F) — J2 P {=s w(p)dist(p, F). The size of the weighted coreset S 
is defined to be \S\. 

We note that in the literature, the requirement that the weights be non-negative, as well 
as the requirement that the coreset S be a subset of the original instance P, are sometimes 
relaxed. We include these requirements in the definition to emphasize that the coresets 
constructed here do satisfy them. We now define the sensitivities of points in a shape fitting 
instance, and the total sensitivity of the instance. 

► Definition 2 (Sensitivity of a shape fitting instance [10J). Given an instance P C M. d of 
a shape fitting problem (R d , F, dist), the sensitivity of a point p in P is <rp(p) := hif{/3 > 
0|dist(p,P) < /3dist(P,P),VP £ F}. 

Note that an equivalent definition is to let crp(p) = sup Fg jr dist (p, F) /dist (P, F), with 
the understanding that when the denominator in the ratio is 0, the ratio itself is 0. 

The total sensitivity of the instance P, is defined by &p := Y^ P £P a p(p)- The total 
sensitivity function of the shape fitting problem is & n :— sup| P | =II &p. 

We now need a somewhat technical definition in order to be able to state an important 
earlier result from [BJ. On a first reading, the reader is welcome to skip the detailed definition. 

► Definition 3 (The dimension of a shape fitting instance |6J). Let P C R d be an instance of 
a shape fitting problem (R d , F, dist). For a weight function w : P — > E + , consider the set 
system (P,TZ), where 1Z is a family of subsets of P defined as follows: each element in 1Z is a 
set of the form Rf, t for some F £ F and r > 0, and Rf,t = {p £ P \ w p ■ dist(p, P) < r}. 
That is, Rf,v is the set of those points in P whose weighted distance to the shape F is at 
most r. The dimension of the instance P of the shape fitting problem, denoted by dim (P), 
is the smallest integer m, such that for any weight function w and A C P of size \A\ = a > 2, 
we have: \{An R F , r \F e F,r > 0}\ < a m . 

For instance, in the (j, fc)-projective clustering problem with the underlying distance 
function dist being the z th power of the Euclidean distance, the dimension dim (P) of any 
instance P is O(jdk), independent of |P| [BJ. This is shown by methods similar to the ones 
used to bound the VC-dimension of geometric set systems. In fact, this bound is the only 
fact that we will need about the dimension of a shape fitting instance. 

The following theorem recalls the connection established in [BJ between coresets and 
sensitivity via the above notion of dimension. 
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► Theorem 4 (Connection between total sensitivity and e-coreset |6J). Given any n-point 
instance P C M. d of a shape fitting problem P, dist), and any e G (0, 1], there exists an 
e-coreset for P of size O ^(^f 1 ) dim (P)j . 

Finally, we will need known bounds on the total sensitivity of (j, fc)-projective clustering 
problem. These earlier bounds involve the dimension d corresponding to shape fitting problem 
(R d , P,dist). 

► Theorem 5 (Total sensitivity of (J, fc)-projective clustering problem in fixed dimension [12J). 

We have the following upper bounds of total sensitivities for the (j, k) -projective clustering 
problem (R d , P, dist), where dist is the z-th power of the Eucldiean distance for z £ (0, oo) . 
h j ; = 1 (k-line center): G n is 0{k^ k ' d ^ log n), where f(d 7 k) is a function depending only 
on d and k. 

integer (J, k) -projective clustering problem: For any n-point instance P, with each co- 
ordinate being an integer of magnitude at most n c for any constant c > 0, &p is 
0((logn) f ( d ' j <^), where f{d,j,k) is a function depending only on d, j , and k. 

3 Bounding the Total Sensitivity via Dimension Reduction 

In this section, we show that the total sensitivity of a point set P is of the same order as 
that of proj (P, F*), which is the projection of P onto an optimum shape F* from P fitting 
P. This result captures the fact that total sensitivity of a shape fitting problem quantifies 
the complexity of shapes, in the sense that total sensitivity depends on the dimension of 
smallest subspace containing each shape, regardless of the dimension of the ambient space 
where P is from. 

► Definition 6 (projection of points on a shape). For a shape fitting problem (M. d , P, dist), 
define proj : R d x P — > R d , where proj (p, F) is the projection of p on a shape F, that 
is, proj (p, F) is a point in F which is nearest to p, with ties broken arbitrarily. That 
is, dist(p, proj (p,F)) = min 9e j? dist(p, q). We abuse the notation to denote the multi-set 
{proj (p, F) \p € P} by proj (P, P) for P C M. d . 

We first show that &p is 0(6 pro j(p,F*))) where P* is an optimum shape fitting P from P. 
In particular, this implies that when P* is a low-dimensional object, the total sensitivity 
of P C R d can be upper bounded by the total sensitivity of a point set contained in a low 
dimension subspace. 

► Theorem 7 (Dimension reduction, computing the total sensitivity of a point set in high 
dimensional space with the projected lower dimensional point set). Given an instance P of a 
shape fitting problem (&L d , P, dist), let F* denote a shape that minimizes dist(P,F) over all 
P G P. Let p denote proj (p,F*) and let P' denote proj (P, P*). Assume that the distance 
function satisfies the relaxed triangle inequality: dist(p,q) < a(dist(p,r) + dist(r,q)) for any 
p,q,r S M. d for some constant a > 1. Then 

1. the following inequality holds: &p < 2a 2 &p> + a. 

2. if dist(P,F*) = 0, then a P {p) = a P >(p') for each p £ P. If dist(P,F*) > 0, then 

Proof. If dist(P, F*) = 0, then P = P', and clearly both parts of the theorem hold. 
Let us consider the case where dist(P, P*) > 0. By definition, 

crp(p) = inf{^ > | dist(p,P) < /3dist(P,P),VP G P}, 
a P >{p') = inf{j8' > | dist(p',P) < /3'dist(P', P), VP G P}. 
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Let F be an arbitrary shape in F. Then we have 

dist(p, F) < adist{p,p') + adist(p / , F) 

< adist(p,p') + aa P >(p')dist(P', F) 

< adist(p,p') + 2a 2 crp-(p')dist(P, F) 

dist(p,j/) 2 



dist(P,F) 



dist(P, F) + 2aV P /(p')dist(P, F) 



< " ^(p^?) ' dist ( P ' F ) + 2«Vp,(p')dist(P, F) 

dist(p,p') 2 , \ 
a dist(P,F*) +2a ^'(p)Jd 1S t(PF). 

The first inequality follows from the relaxed triangle inequality, the second inequality follows 
from the definition of sensitivity of p' in P', and third inequality follows from the fact 
that dist(P',F) = £ p , e p,dist(p',F) ^ S P ep a (dist(p, F) + dist(p,p')) = a(dist(P,F) + 
dist(P,F*)) < 2adist(P,F), since dist(P,P*) < dist(P,F). 
Thus the second part of the theorem holds. Now, 

©p = ^crp(p) 

peP 

^ / dist(p,p') 2 
- dist(P,F*) FyF ' 



peP 



We make a remark regarding the value of a in Theorem [7] when the distance function is 
z th power of Euclidean distance. It is used in Sections 4, 5, 6, and 7 when we derive upper 
bounds of total sensitivities for various shape fitting problems. 

► Remark (Value of a when dist(-, •) = (|| • ||2) z ). Let z 6 (0,oo). Suppose dist(p, q) — 
(\\p ~ Q II2) 2 • When z 6 (0, 1), the weak triangle inequality holds with a = 1; when z > 1, 
the weak triangle inequality holds with a = 2 Z ~ 1 . For a proof, sec, for example, [5]. 

Theorem [7] bounds the total sensitivity of an instance P of a shape fitting problem (R d , F, dist) 
in terms of the total sensitivity of P'. Suppose that there is an ni2 <§C d so that each shape 
F 6 J- is in some subspace of dimension TO2. In the (j, fc)-projective clustering problem, for 
example, m 2 = k(j + 1). Then note that P' is contained in a subspace of dimension 7712. 
Furthermore, when dist is the z th power of the Euclidean distance, it turns out that for many 
shape fitting problems the sensitivity of P' can be bounded as if the shape fitting problem 
was housed in M 2 " 12 instead of M. d . To see why this is the case for the (j, fc)-projective 
clustering problem, fix an arbitrary subspace G of dimension min{d, 2771-2} that contains P' . 
Then for for any F £ F, there is an F' £ T such that (a) F' is contained in G, and (b) 
dist(p', F') = dist(p', F) for all p'eP'. 

The following theorem summarizes this phenomenon. For simplicity, it is stated for the 
(j, fc)-projective clustering problem, even though the phenomenon itself is somewhat more 
general. 

► Theorem 8 (Sensitivity of a lower dimensional point set in a high dimensional space). Let 

P' be an n-point instance of the (j, k) -projective clustering problem (R , F, dist), where 
dist is the z th power of the Euclidean distance, for some z € (0,oo). Assume that P' is 
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contained in a subspace of dimension mi- (Note that for each shape F G J- , there is a 
subspace of dimension mi = k(j + 1) containing it.) Let G be any subspace of dimension 
m = niin{mi + mi, d} containing P' ; fix an orthonormal basis for G, and for each p' G P' , 
let p" G K m be the coordinates of p' in terms of this basis. Let P" — {p" \ p' G P'}, and 
view P" as an instance of the (j, k) -projective clustering problem (K m , J 7 ', dist), where J-' 
is the set of all k-tuples of j-subspaces in W n , and dist is the z th power of the Eucldiean 
distance. Then, ap'{p') = ap"(jp") for each p' G P' , and 6 pi = &p». 



4 A;-median/A;-means Clustering Problem 

In this section, we derive upper bounds for the total sensitivity function for the fc-median/fc- 
means problems, and its generalizations, where the distance function is z th power of Euclidean 
distance, using the approach in Section 4. These bounds are similar to the ones derived by 
Langberg and Schulman [TO], but the proof is much simplified. For the rest of the article, 
dist is assumed to be the z th power of the Euclidean distance. 

► Theorem 9 (Total sensitivity of (0, A:)-projective clustering). Consider the shape fitting 
problem (R d , J 7 , dist), where T is the set of all k-point subsets ofR d . We have the following 
upper bound on the total sensitivity: 

e n <2 2z ~ 1 k + 2 z ~ 1 , z>l, 
e n <2k + l, zg(0,1). 

In particular, the total sensitivity of the k-median problem (which corresponds to the case 
when z = I) is at most 2k + 1, and the total sensitivity of the k-means problem (which 
corresponds to the case when z = 2) is 8k + 2. 

Proof. Let P be an arbitrary n-point set. Apply Theorem [7j and note that proj (P,C*) 7 

where C* is an optimum set of k centers, contains at most k distinct points. Assume 

that C* = {c\, c|, • • • , tif,}. Let Pi be the set of points in P whose projection is c*, that is, 

Pi = {p G -P|proj (p, C*) = c*}. It is easy to see that the summation of sensitivities of the \Pi\ 

copies of c* is at most 1: for any /c-point set C in M. d , \PA- ^ s ^( c i = J-P.ldist^ ,c) — < \ 
F 1 J F ' 1 11 dist(c*,c) J2 = i l^|dist( c *,c) - 

Therefore, the total sensitivity of proj (P,C*) is at most k. Substituting a from the 

remark after Theorem [7J we get the above result. -4 

► Theorem 10 (e-coreset for (0, fc)-projective clustering). Consider the shape fitting problem 

T , dist), where T is the set of all k-point subsets ofM. d . For any n-point instance P, 
there is an e-coreset of size 0(k 3 de~ 2 ). 

Proof. Observe that the dim (P) is O(kd). Using Theorem [4] and Theorem [9] we obtain the 
above result. -4 



5 A;-line Clustering Problem 

In this section, we derive upper bounds on the total sensitivity function for the fc-line 
clustering problem, that is, the (1, fc)-projective clustering problem. 

► Theorem 11 (Total sensitivity for fc-line clustering problem). Consider the shape fitting 
problem (R d , T, dist), where J- is the set of k -tuple of lines. The total sensitivity function, 
& n , is 0{k^ k ^ log n), where f{k) is a function the depends only on k. 
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Proof. Let P be an arbitrary n-point set. Let K* denote an optimum set of k lines fitting P. 
Using Theorems [7] and [8] it suffices to bound the sensitivity of an n-point instance of a fc-line 
clustering problem housed in K 4fc . By Theorem [5] the total sensitivity of this latter shape 
fitting problem is 0{k^ k ^ logn), where f(k) is a function depending only on k. Therefore, 
6„ is 0(fc^( fc )logn). 

(Alternatively, one could use a recent result in [5]. Let P' denote the projection of P into 
K* . Since K* is a union of k lines, we can upper bound the sensitivity of P' by k times 
the sensitivity of an n-point set that lies on a single line. The sensitivity of an n-point set 
that lies on a single line can be upper bounded by the sensitivity of an n-point set for the 
weighted (0, fc)-projective clustering problem, for which the sensitivity bound is 0(k^ k ^ log n) 
as shown in |5].) -4 

Notice that for fc-line clustering problem, the bound on the total sensitivity depends logarith- 
mically on n. We give below a construction of a point set that shows that this is necessary, 
even for d = 2. 

► Theorem 12 (The upper bound of total sensitivity for fc-line clustering problem is tight). For 

every n>2, there exists an n-point instance of the k-line clustering problem (K 2 , J-, dist), 
where dist is the Euclidean distance, such that the total sensitivity of P is fi(logn). 

Proof. We construct a point set P of size n, together with n shapes Fi <E F, i — 1 , • • • , n, 
such that Y^i=i dist(p.;, Fj)/dist(P, Fi) is Q(logn). Note that this implies that &p is at least 
fi(logn). Let P be the following point set in R 2 : pi = (l/2 t-1 ,0), for i = 1, • • • ,n. Let Fj be 
a pair of lines: one vertical line and one horizontal line, where the vertical line is the y-axis, 
and the horizontal line is {(x, l/2 z )\x £ K}. 

Consider the point p i: where i = 1, • • • ,n. We show that dist (pi, Fi) /dist (P, Fi) is at 
least 1/(2 + i), for i = 1, ■ ■ ■ , n. For j < i, note that dist(f>j, F,) = 1/2*: since the distance 
from pj to the horizontal line in Fi is 1/2* and the distance to the vertical line is 1/2- 7-1 , 
dist(^-,Pi) = min{l/2^ 1 ,l/2 t } = l/2\ For i+ 1 < j < n, on the other hand, dist(pj,P;) = 
Therefore, £"= i+ i dfetfo, F«) - £"=<+! 1/2^ - (1/2*" 1 ) • (1 - (1/2)"-). Thus, 

we have 

, , dist(pi,P) dist( Pl ,F t ) l/2 l l 

a P (pi) = sup — — — > — — — = — — . _.. ■ - - . ., > 



Per dist (P, P) - dist(P, Fi) (1/2'- 1 - 1/2"- 1 ) + i ■ (1/2*) 2 + i 

Therefore, © P > ^? =1 cr P (pi) > ^™ =1 which is fi(logn). >4 

► Theorem 13 (e-coreset for fc-line clustering problem). Consider the shape fitting problem 
(K d , J 7 , dist), where T is the set of all k-tuples of lines in R d . For any n-point instance P, 
there is an e-coreset with size 0(k^ k ^d(logn) 2 /e 2 ). 

Proof. This result follows from Theorem |11[ Theorem |1J and the fact that dim (P) in this 
case is O(kd). -4 

6 Subspace approximation 

In this section, we derive upper bounds on the sensitivity of the subspace approximation 
problem, that is, the (j, l)-projective clustering problem. For the applications of Theorems [7] 
and [8] in the other sections, we use existing bounds on the sensitivity that have a dependence 
on the dimension d. For the subspace approximation problem, however, we derive here the 
dimension-dependent bounds on sensitivity by generalizing an argument from |10j for the 
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case j = d—X and z = 2. This derivation is somewhat technical. With these bounds in hand, 
the derivation of the dimension-independent bounds is readily accomplished in a manner 
similar to the other sections. 

distance. Although the size of the e-coreset obtained in this way is exponential in j, 
which is larger than the size of the coreset in [5] [?] and Theorem ?? in this section, it is still 
a constant (as j is considered as a constant) and in particular, independent of the cardinality 
of the input point set. It can be considered as an simple and straight-forward way to see 
why small e-coresets exist for j-subspace approximation problems. 



6.1 Dimension-dependent bounds on Sensitivity 

We first recall the notion of an (a, [3, z)- conditioned basis from [S], and state one of its 
properties (Lemma 15 1. We will use standard matrix terminilogy: denotes the entry in 
the i-th row and j-th column of M, and Mj. is the i-th row of M. 

► Definition 14. Let M be an n x m matrix of rank p. Let z G [1, oo), and a, (3 > 1. An 

n x p matrix A is an (a, /3, z) -conditioned basis for M if the column vectors of A span the 
column space of M, and additionally A satisfies that: (1) . \a,ij\ z < a z , (2) for all u e R p , 
II u Wz 1 < /3|| Au || z , where || • || z / is the dual norm for || • || z (i.e. l/z + 1/z' = 1). 

► Lemma 15. Let M be an n x m matrix of rank p. Let z £ [1, oo) . Let A be an (a, /?, z)- 

conditioned basis for M . For every vector u £ R m , the following inequality holds: \Mi.u\ z < 
{\\Ai.\\%.l3*)\\Mu\\' z . 

Proof. We have M = At for some p x m matrix r. Then, 

\Mi.u\ z = \Ai.Tu\' < || A- Wl ■ II ru |||, < || Ai. \\% ■ /3*\\ Am \\* = \\ A t . \\* z ■ 0>\\ Mu ||*. 

The second step is Holder's inequality, and the third uses the fact that A is (a, j3, z)- 
conditioned. A 

Using Lemma [15] we derive an upper bound on the total sensitivity when each shape is a 
hyperplane. 

► Lemma 16 (total sensitivity for fitting a hyperplane). Consider the shape fitting problem 
(K d , J 7 , dist) where T is the set of all (d— l)-flats, that is, hyperplanes. The total sensitivity 
of any n-point set is 0(d 1+z ^ 2 ) for 1 < z < 2, 0(d) for z = 2, and 0(d z ) for z > 2. 

Proof. We can parameterize a hyperplane with a vector in K rf+1 , u = \u\ • ■ ■ u^+i] : the 
hyperplane determined by u is h u = {x 6 R d \ Yli=x u i%i + u d+i = 0}, where Xi denotes the i th 
entry of the vector x. Without loss of generality, we may assume that Yli=i u i = The Eu- 



clidean distance to h u from a point q € M. d is dist(q, h u ) — \ J2i=i + u d+i\/ y Si=i u i = 
Ej=i u i1i + u d+i\- (the second equality follows from the assumption that J^ i=1 uf = 1.) 

Let P = {pi,p2, • ■ ■ ,Pn} C R d be any set of n points. Let pi denote the row vector 
[pf l] , and let M be the n X (d + 1) matrix whose i th row is pi. Then, dist(pj, h u ) — 



Mi.u\ z , and dist(P, h u ) = Y17=i \Mi.u\ — ||Mw||f. Then using Lemma 15 we have 

\Mj.' 
« II Mv 



vp{Pi) = sup u l A ! V u \ z < || Ai. || J • f3 z , where A is an (a, /3, z)-conditioned basis for M. Thus 
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For 1 < z < 2, M has ((d+l) 1/z+1/2 , 1 , z)-conditioned basis; for z = 2, M has ((cZ+l) 1/2 , 1, z)- 
conditioned basis; for z > 2, Af has ((g? + l) 1 / z + 1 / 2 ) (of + l) 1 / 2 _1 / 2 , z)-conditioned basis 
jS]. Thus the total sensitivity for the three cases are (d + l) 1+z / 2 , d+1, and (d + l) z , 
respectively. -4 

It is now easy to derive dimension-dependent bounds on the sensitivity when each shape 
is a j-subspace. 

► Corollary 17 (Total sensitivity for fitting a j-subspace). Consider the shape fitting problem 
(K d , J",dist) wh ere F is the set of all j -flats. The total sensitivity of any n-point set is 
0{d 1+z / 2 ) for 1 < z < 2, O(d) for z = 2, and 0(d z ) for z > 2. 

Proof. Denote by F' the set of hyperplanes in R d . Let P C M. d be an arbitrary n-point set. 
We first show that o~p^(p) < erp^' (p), where the additional subscript is being used to indicate 
which shape fitting problem we are talking about (hyperplanes or j- flats) . Let p be an arbitrary 
point in P. Let F p € T denote the j-subspace such that crp^(p) = dist(p, Fp)/dist(P, F p ). 
Let proj (p, F p ) denote the projection of p on F p . Consider the hyperplane F' containing 
F p and orthogonal to the vector p — proj (p, F p ). We have dist(p, F') = dist(p, F p ), whereas 
dist(g,F') < dist(g,F p ) for each q e P. Therefore, ap,r>{p) > dist(p, F')/dist(P, F') > 
dist(p, F p )/dist(P, F p ) = <jp y jr(p). It follows that &p^ < &p.t>. The statement in the 
corollary now follows from Lemma |16| -4 

6.2 Dimension-independent Bounds on the Sensitivity 

We now derive dimension-independent upper bounds for the total sensitivity for the j-subspace 
fitting problem. 

► Theorem 18 (Total sensitivity for j-subspace fitting problem). Consider the shape fitting 
problem (R d , J-, dist) where J- is the set of all j -flats. The total sensitivity of any n-point set 
is 0(j 1+z/2 ) for 1 < z < 2, O(j) for z = 2, and 0(j z ) for z>2. 

Proof. Use Theorem [7J note that the projected point set P' is contained in a j-subspace. 
Further, each shape is a j-subspace. So, applying Theorem [8] and Corollary fTTj the total 
sensitivity is 0(j 2+z / 2 ) or z G [1, 2), O(j) for z — 2 and 0(j z ) for z > 2. -4 

Using Theorem [l8| and the fact that dim (P) for the j-subspace fitting problem is O(jd), 
we obtain small e-coresets: 

► Theorem 19 (e-coreset for j-subspace fitting problem). Consider the shape fitting problem 
(R d , J 7 , dist) where J- is the set of all j -flats. For any n-point set, there exists an e-coreset 
whose size is 0(j 3+z de~ 2 ) for z S [1,2), 0(j 3 de~ 2 ) for z — 2 and 0(j 2z+1 de~ 2 ) for z>2. 

Proof. The result follows from Theorem [l8j and Theorem [3] -4 

We note that for the case j = d — 1 and z = 2, a linear algebraic result from [2] yields a 
coreset whose size is an improved 0(de~ 2 ). 

7 The (j, k) integer projective clustering 

► Theorem 20. Consider the shape fitting problem (K d , F, dist), where F is the set of k- 
tuples of j- flats. Let P C M. d be any n-point instance with integer coordinates, the magnitude 
of each coordinate being at most n c , for some constant c. The total sensitivity &p of P is 
0{{\ogny {k ^), where f(k,j) is a function of only k and j . There exists an e-coreset for P 
of size 0((logn) 2 ^')fcjde- 2 ). 



On the Sensitivity of Shape Fitting Problems 



Proof. Observe that the projected point set P' = proj (P, { J* , • • • , J£}), where {J^ , • • ■ , J£} 
is an optimum fc-tuple of j-fiats fitting P, is contained in a subspace of dimension O(jfc). 
Using Theorem [5] Theorem [8] and Theorem [7] the total sensitivity 6p is upper bouned by 
0((logn)f( k 'fi), where f(k,j) is a function of k and j. (A technical complication is that 
the coordinates of P', in the appropriate orthonormal basis, may not be integers. This can 
be addressed by rounding them to integers, at the expense of increasing the constant c. A 
similar procedure is adopted in |12j . and we omit the details here.) 

Using Theorem |4] and the fact that dim (P) is O(djk), we obtain the bound on the 
coreset. -4 
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